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ABS TRACT 


The building sector is a major source of energy consumption and greenhouse gas emissions in urban 
regions. Several studies have explored energy consumption prediction, and the value of the knowledge 
extracted is directly related to the quality of the data used. The massive growth in the scale of data affects 
data quality and poses a challenge to traditional data mining methods, as these methods have difficulties 
coping with such large amounts of data. Expanded algorithms need to be utilized to improve prediction 
performance considering the ever-increasing large data sets. 

In this paper, a preprocessing method to remove noisy features is coupled with predication methods to 
improve the performance of the energy consumption prediction models. The proposed preprocessing 
method is based on the well-known principal component analysis (PCA) and treats the historical 
meteorological and energy data of buildings. The cleaned and processed data are used in five prediction 
models including linear regression, support vector regression, regression tree, random forest and K- 
nearest neighbors. 

The proposed methodology is applied to four case studies with different climate zones (cold, mild, 
warm-dry and hot-humid) to study the effect of dataset patterns on the feature reduction and prediction 
performance. The results show that the proposed method enables practitioners to efficiently acquire a 
smart dataset from any big dataset for energy consumption prediction problems. In addition, the best 
prediction model for each climate zones with considering mean square error, R, residual values and 
execution time is proposed. 


1. Introduction 


The building sector is one of the largest consumers of energy 
(39—40%) and emitters of greenhouse gases (38%—39%) in the 
world (Becerik-Gerber et al., 2013). Energy consumption prediction 
(Pham et al., 2020) and monitoring (Parhizkar et al., 2019, 2020) in 
buildings helps to increase the effectiveness and efficiency of de- 
cisions made to reduce energy demand and carbon emissions. 
Energy consumption models are mostly used in the first step of 
energy management and efficiency improvement models, such as 
optimizing operations and reducing costs (Li et al., 2020); sizing 
thermal energy storage to improve energy efficiency (Lin et al., 
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2019); modeling heating, ventilation and air conditioning (HVAC) 
systems to reduce energy consumption (Chen et al., 2020); and 
planning urban energy systems (Moghadam et al., 2017). 

Two approaches have been used to predict energy consumption: 
principle-based modeling (white box) and data-driven modeling 
(black box) (Guermoui et al., 2020). In the principle-based 
approach, the inputs can be weather features, geographic loca- 
tions, building designs, building material properties, occupancy 
characteristics and operating schedules, and the outputs are 
building load estimates (Gan et al., 2020). EnergyPlus, Ecotect, and 
eQuest are energy simulation software tools used to predict 
buildings’ energy consumption. Principle-based models have a high 
accuracy in modeling buildings’ energy consumption (Gan et al., 
2020). However, due to lack of access to building design and ma- 
terial characteristics, principle-based approaches can be uncertain 
and time consuming. 

In contrast, data-driven models, which has received attention in 
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recent years, no longer requires the building design and material 
characteristics. These models base the predictions on the system 
historical data (Parhizkar et al., 2017, 2018). One of the basic data- 
driven prediction methods is the linear regression model. In 
(Iwafune et al., 2014), linear regression method is utilized to predict 
daily energy consumption of residential buildings. Outside tem- 
perature and date are considered as two affecting factors on energy 
consumption. The model is used to perform demand side man- 
agement of residential buildings. In (Zhang et al., 2020), multivar- 
iate regression method is used to predict energy consumption for 
optimal design of building environment. In (Alobaidi et al., 2018), a 
regression model for predicting the average daily energy con- 
sumption of individual households is proposed. This framework 
utilizes information diversity to predict the day-ahead average 
energy consumption. To further enhance generalizability, a robust 
regression component was proposed. The proposed method was 
applied to a case study in France, and the results illustrate signifi- 
cant improvement in alleviating the unstable prediction problems 
that exist in other models. Afroz et al. (2018) developed an indoor 
temperature prediction model for commercial buildings. In this 
study, a nonlinear autoregressive network that considers exoge- 
nous input-based system identification was proposed to predict 
indoor temperatures. The optimal input parameters, size of 
network, and size of training data affected the performance of the 
model. Using sensitivity analysis, the researchers proposed a model 
that provided an accurate prediction for up to 28 days ahead. 

In addition to regression methods, other machine learning al- 
gorithms can be used to predict building energy consumption 
(Zhou et al., 2016). Artificial neural networks (ANNs) are one of the 
most popular data-driven energy predication models that are 
designed based on the basic functions of human brain including 
processing units and biological neurons. A network consists of one 
or multiple processing units arrayed in layer, which are connected 
via connections. The method is presented comprehensively in 
(Bagnasco et al., 2015). Using this method, the hourly overall 
(Ilbeigi et al., 2020), cooling (Luo, 2020) and heating (Bui et al., 
2020) energy consumption of buildings could be predicted. In 
(Rahman et al., 2018), a recurrent neural network model is pro- 
posed to predict the energy consumption of commercial and resi- 
dential buildings. In this study, a deep neural network was used to 
perform imputation on datasets containing segments of missing 
values. The method was applied to datasets for a commercial 
building and a residential building. The results illustrate that the 
recurrent neural network model corresponded to a lower relative 
error compared to a conventional multi-layered perceptron neural 
network in the commercial building. However, in the residential 
building, the proposed model did not provide high accuracy in 


comparison to the multi-layered perceptron model. 

Another popular data-driven method is the support vector 
regression (SVR). Multiple studies have used this method to predict 
hourly cooling (Li et al., 2009), heating (Chou and Bui, 2014) and 
overall (Shao et al., 2020) energy consumption of buildings. For 
instance, Ma et al. (2019) proposed an SVR model to predict 
building energy consumption in southern China. Multiple features, 
including weather data and economic factors, were taken as inputs, 
and the prediction model performance was evaluated using data 
provided by the Chinese National Bureau for four provinces of 
southern China. The results indicated that the SVR method has a 
high accuracy in predicting building energy consumption. 

Random forest is one of the most widely used decision tree 
methods in the field of buildings energy consumption prediction. In 
(Fan et al., 2014), daily energy consumption of a non-residential 
building is predicted using random forest method. Maximum 
dry-bulb temperature, average dry-bulb temperature, minimum 
dry-bulb temperature, average dew point temperature, average 
relative humidity, average pressure, average amount of cloud, total 
rainfall, number of hours of reduced visibility, solar radiation, total 
evaporation and average wind speed are considered as affecting 
factors. Data of one year is used to train the model and resulted in 
3.17% mean absolute percentage error. K-nearest neighbors is 
another statistical algorithm that has been used in this study for 
energy consumption prediction. This method could predict overall 
energy consumption of the building with 4.01% mean absolute 
percentage error. 

There are several studies that have compared data-driven 
models in different case studies. For instance, Candanedo et al. 
(2017) presented a data-driven predictive model to predict elec- 
tricity loads. This study compared four data-driven methods: 
multiple linear regression (MLR), support vector machine (SVM) 
with radial kernel, the random forest approach and gradient 
boosting machines (GBMs). The results showed that the GBM 
method, with a variance of 97% (R°) in the training set and 57% in 
the testing set, was the most efficient when all predictors were 
used. Guo et al. (2018) compared four machine learning methods to 
predict the energy demand of building heating systems: support 
vector regression (SVR), MLR, the extreme learning machine 
approach and a backpropagation neural network. Data on building 
heating using a ground source heat pump system were used to test 
and compare the performances of the models, which take meteo- 
rological parameters, operating parameters, time and indoor tem- 
perature parameters as inputs. Their results indicated that the 
performance of an extreme learning machine model with 11 hid- 
den layer nodes and feature set 4 is better than the other methods. 
In (Gassar et al., 2019), multiple machine learning methods for 
predicting gas and electricity consumption in London’s residential 
buildings are compared. The study considered the multilayer neural 
network (MNN), MLR, random forest and gradient boosting (GB) 
methods, and the input features examined were socio- 
demographic, economic and building characteristics. The results 
show that household income, number of households and building 
characteristics are the most important features of gas and elec- 
tricity consumption. The MNN models outperformed the MLR, 
random forest and GB models at predicting energy consumption by 
London’s residential buildings. In another study conducted by 
Gungor et al. (2019) four years of household electricity consump- 
tion data are used to compare various machine learning methods, 
including the random forest, K-nearest neighbors (KNN), stochastic 
gradient descent, logistic regression and SVM approaches. The re- 
sults indicated that the random forest method had the lowest 
prediction error. 

In almost all of these studies, an efficient method for building 
energy consumption prediction is proposed according to the results 
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Fig. 2. A PCA loading plot as an example. 


of a case study. However, prediction methods are highly dependent 
on the historical data of the buildings studied, and it is not possible 
to propose a general method applicable to all buildings of the same 
type (commercial, residential, etc.). In our study, the dependence of 
the method’s performance on the energy consumption data pattern 
is clarified. 

In addition, most of the reviewed literature has focused on using 
data-driven methods to predict energy consumption. To develop an 
accurate model, most of the building features should be considered, 
including the time, outside dry bulb temperature, outside dew 





Fig. 3. A simple linear regression plot for an example scatter plot. 
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Fig. 4. Threshold and error definition in the SVR method for an example scatter plot. 


point temperature, direct normal solar radiation, diffuse horizontal 
solar radiation, wind speed, wind direction, atmospheric pressure, 
solar azimuth, relative humidity, air temperature, radiant temper- 
ature and operative temperature. Technology that can accurately 
monitor, collect and store the vast amount of data involved in this 
process is now available. Recent literature on this topic indicates 
that while the available methods are able to predict energy con- 
sumption, there is still room for improvement. Specifically, as the 
amount of data increases over time, the available methods tend to 
predict energy consumption with lower accuracy or much higher 
execution times. As a result, preprocessing techniques should be 
utilized to aid data processing in prediction models (Lin et al., 
2020). In this study, the principal component analysis (PCA) 
method was used to identify the features with the strongest effect 
on energy consumption. Principal component analysis is a 
dimensionality-reduction method that is often used to reduce the 
dimensionality of large datasets by transforming a large set of 
variables into a smaller one that still contains most of the infor- 
mation of the large set. This method was used to find the factors 
with the greatest impact on energy consumption prediction. 

This study utilizes the PCA as a preprocessing method to assist 
five energy consumption prediction models, widely used in this 
field. The integrated PCA based prediction models are applied to 
four types of energy data patterns, and the results are compared. 
Results show that in all cases, the PCA method helps to identify the 
most important features for energy consumption prediction and 
improves prediction model performance. The main contributions of 
this research could be summarized as follows: 
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1. In this study, the dependence of prediction models’ perfor- 
mance on the historical data, in addition to the building type, is 


= - 
“see = 





(d) 


Fig. 6. Four main steps of the KNN algorithm. 





illustrated. In other words, this study demonstrates that in 
addition to the building type, the climate zone should be 
considered in selecting the prediction model. This finding could 
assist in the development of hybrid methods in online predic- 
tion models that could be updated over time based on a build- 
ing’s historical data (i.e., the energy prediction model of the 
building could be continuously updated and changed in accor- 
dance with the data gathered over time). 


2. AS mentioned, there is a growing need for energy prediction 


models that can monitor and analyze building energy perfor- 
mance. However, the continuously growing complexity of en- 
ergy systems and the ever-increasing amount of data make this 
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Table 1 
The construction properties of the studied building. 
Component Construction properties 
Wall Gypsum (13 mm) + concrete block (100 mm) + polystyrene (20 mm) + brickwork outer (100 mm) 
Roof Gypsum (13 mm) + air gap (25 mm) + concrete (300 mm) + wool + asphalt (10 mm) 
Window Generic clear (6 mm) + air (6 mm) + generic blue (6 mm) 
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Fig. 8. Relative humidity (a), dry bulb temperature (b), dew point temperature (c), ambient temperature (d), wind speed (e) and atmospheric pressure (f) over a year in the studied 
climate zones. 
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Fig. 9. Occupancy (a) and lighting (b) daily profiles. 


to four different climate zones. On that basis, the most efficient the PCA model. The historical data consist of 20 features, as pre- 

PCA-based energy prediction method for each climate zone is sented in Fig. 1. In the PCA model, the most influential features are 

proposed. selected based on the model’s load factors and problem restrictions. 

In the second step, the reduced features are utilized to predict 

energy consumption using multiple prediction models, including 

2. Research framework and methodology linear regression, SVR, regression tree, random forest and KNN 

models. In this step, the most efficient prediction model is selected 

Fig. 1 presents the data flow diagram of the PCA-based energy based on factors such as the mean square error (MSE) and R? values. 

consumption prediction method, displaying its two main steps. In ‘nally, the selected model is utilized to predict the hourly energy 
the first step, the historical data of a building are taken as inputs for ©OSumption of the building. 
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Fig. 10. Hourly heating energy consumption in the case studies. 
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Fig. 11. Hourly cooling energy consumption in the case studies. 


2.1. The principal component analysis (PCA) model 


As shown in Fig. 1, the features of the historical meteorological 
data were extracted using statistical methods. These data are 20 
features of a building’s energy consumption, as presented in Fig. 1. 
The historical values of these features were used as inputs in the 
PCA model. 

Principal component analysis is a mathematical procedure that 
transforms a number of (possibly) correlated variables into a 
smaller set of uncorrelated variables called principal components 
(Skjezrvold et al., 2006). As one of the effective factor analysis 
method, PCA is widely used to reduce the number of variables 
under study, and consequently the ranking and analysis of 


decision-making units (Ghaderi et al., 2006). 

Primarily, PCA decomposes data matrix X into a structure anda 
noise part. As presented in Eq. (1), data matrix X (n x k) is split into 
a modelled part My (n x k) and a residual error part E (n x k) 
(Huang et al., 2019). 


X=My+E (1) 


The modelled part of X is expressed as a subspace with 
dimensionality A, where A represents the number of principal 
components. Consequently, when the chosen model dimension- 
ality A is changed, the error content also varies. 
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The historical data determine how many principal components 
are needed to have an accurate prediction model. The principal 
components are used to reduce features to a smaller number. The 
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reduction process can be performed using a PCA loading plot of 
principal components (Gao et al., 2013). A loading plot shows how 
strongly each characteristic influences a principal component. 

Fig. 2 presents a loading plot for a case as an example. Loadings 
can range from —1 to 1. Loadings close to —1 or 1 indicate that the 
variable strongly influences the component. Loadings close to 
0 indicate that the variable has a weak influence on the component. 
Evaluating the loadings can also help one to characterize each 
component in terms of the variables. 

In this example, the ambient temperature has a large positive 
loading and the solar radiation has a large negative loading on 
components 1 and 2. As a result, these features are more critical 
than the atmospheric pressure, which has a low loading value. 
Hence, atmospheric pressure can be eliminated in case of data 
reduction. 


2.2. Linear regression 


Linear regression is a basic and commonly used predictive 
model. Linear regression consists of finding the best-fitting straight 
line through the points. The best-fitting line is called a regression 
line. Fig. 3 shows a scatter plot of sample data. The red diagonal line 
is the regression line and consists of the predicted Y value for each 
possible value of X. As shown, Y can be predicted as a function of X. 
More precisely, each data point can be represented by using this 
equation plus an error. The distance between the data and the line 
represents the prediction error. The linear regression algorithm 
finds the constant values of the linear equation by minimizing 
these errors for all sample data (Harrell, 2015). 


2.3. Support vector regression (SVR) 


As the name suggests, SVR is a modified regression algorithm. In 
simple regression, we try to minimize the error rate, while with 
SVR we try to fit the error within a certain threshold. According to 
the regression model, SVR could be linear or nonlinear. Fig. 4 
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Fig. 16. Loading (a, b), cumulative explained variance (c) and scatter score (d) plots for the Bandar Abbas dataset. 


presents a linear SVR example. As can be seen, the error is calcu- 
lated from the threshold lines. This is the main difference between 
the regression method and SVR, but there are other rules that 
differentiate these two methods (Smola and Scholkopf, 2004). 


2.4. Regression tree 


A regression tree builds a regression model in the form of a tree 
structure. It breaks down a dataset into smaller and smaller subsets, 
while at the same time incrementally developing an associated 


decision tree. The final result is a tree with decision nodes and leaf 
nodes. A decision node (e.g., energy consumption) has two or more 
branches (e.g., low, medium and high), each representing values for 
the attribute tested. A leaf node represents a decision on the nu- 
merical target (Loh, 2008). 

The regression tree algorithm has two main parts. First, a deci- 
sion tree is developed; in this step, data are branched according to 
some indexes. The next step is a regression model that is developed 
for each branch of the decision tree. This step mostly follows 
regression model rules. Fig. 5 presents a simple example of a 
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Fig. 17. Loading (a, b), cumulative explained variance (c) and scatter score (d) plots for the Yazd dataset. 


cooling energy consumption prediction model that follows the 
regression tree method. This example consists of four buildings 
with different characteristics (with and without radiation shielding 
windows) and different climate zones (high and low ambient 
temperature, with and without high solar radiation). The energy 
consumption of buildings A and D is presented in the figure. 


2.5. Random forest 


Random forest is an ensemble classifier developed by con- 
structing a multitude of decision tree models during the training 
phase from a randomly chosen subset of the training set to obtain a 
better predictive performance. Ensemble models combine the re- 
sults from diverse models. The results from an ensemble model are 
usually better than the results from an individual model. This 
model then aggregates the votes from different decision trees to 
make the final decision for the test object. A subset of data is 
attributed to each tree in the random forest algorithm (Liaw and 
Wiener, 2002). For example, if 2000 rows (sample) and 50 col- 
umns (feature) are created, 200 rows and 20 columns, which are a 
subset of the data, are assigned to each tree. By using this data 
subset, these trees can decide and generate the training model. 


2.6. K-nearest neighbors (KNN) algorithm 


The KNN algorithm is a non-parametric method used for data 
prediction. In this algorithm, data are labeled (Soucy and Mineau, 
2001) and prediction is performed based on the labels. Fig. 6 pre- 
sents a simple example of the KNN algorithm. At the first step, the 
Euclidean or Mahalanobis distance from the query example to the 
labeled examples is calculated. In the figure, these distances are 
shown by light blue arrows. Next, the labeled examples are ordered 
by increasing distance (for instance, the label represented by tri- 
angles in the solid line circle). By increasing the distance (K), the 
order changes as the number of squares in the dashed-line circle is 
higher. In the third step, a heuristically optimal number K of nearest 
neighbors based on the RMSE should be found. This process is 


carried out using cross-validation. Last, an inverse distance 
weighted average with the k-nearest multivariate neighbors is 
calculated, and the unknown data are labeled accordingly. 


3. Case study 


To study the effectiveness of the proposed method, we applied 
the proposed method to four datasets. As a result, the sensitivity of 
the method to different input datasets is detected. The selected four 
datasets are weather and energy data of buildings in four different 
climate zones. 

Datasets include hourly weather (relative humidity, dry bulb 
temperature, dew point temperature, ambient temperature, wind 
speed and atmospheric pressure) and hourly energy consumption 
(cooling and heating) data. In this section, the building character- 
istics, climate conditions of the studied zones and energy con- 
sumption of the office buildings located in these zones are 
presented. 


3.1. Building information 


An office building in four different climate zones, as shown in 
Fig. 7, was considered as a case study to evaluate the performance of 
the proposed methodology. 

The size of the building is 21.80 m x 22 m x 3.5 m. The solar heat 
gain coefficient (SHGC) and glazing U-value are 0.503 and 3.094 
(W/m2.K), respectively. The window-to-wall ratio is 21%. The oc- 
cupancy density, lighting power density and equipment power 
density are 18.6 (W/m7), 19.5 (W/m7) and 10.8 (W/m7), respectively. 
The HVAC system of the building is a four-pipe fan coil unit. Four- 
pipe systems have separate heating and cooling fan coil units, 
which means that hot or chilled water is always available, enabling 
the system to immediately change over from heating to cooling 
mode. The construction properties of the building are summarized 
in Table 1. 
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The least influential variables presented in PCA loading plots. 
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Table 3 


Comparison of five prediction models for the Tehran dataset. 


Prediction model/metric 


Linear regression 
SVR 

Regression tree 
Random forest 
KNN 


MSE (kW)? R? 


Train 


11.56 
9.02 
0.67 
0.15 
10.20 


Time(s) 
Test 


0.64 
0.64 
0.97 
1.00° 
0.56 


Train 


0.68 
0.75 
0.98 
1.00° 
0.72 


Test 


12.69 
12.59 
1.18 
0.39 
15.60 


1.66 
10.62 
0.94 
6.45 
1.18 


Table 4 

Comparison of five prediction models for the Tabriz dataset. 
Prediction model/metric MSE (kW)? R? Time(s) 

Train Test Train Test 

Linear regression 18.96 20.59 0.67 0.64 1.66 
SVR 13.68 20.12 0.76 0.65 10.98 
Regression tree 0.93 2.41 0.98 0.96 0.93 
Random forest 0.24 1.10 0.99 0.98 7.13 
KNN 15.62 24.45 0.72 0.57 1.20 


* The actual value is 0.99712 that is rounded up to 1. 
> The actual value is 0.99601 that is rounded up to 1. 


3.2. Weather data 


Four climate zones are considered to compare the effectiveness 
of the method. The hourly climate data (relative humidity, dry bulb 
temperature, dew point temperature, ambient temperature, wind 
speed and atmospheric pressure) of these four zones are derived 
from a weather forecast website ! Fig. 8(a) displays the hourly 





1 https://www.weather-forecast.com/countries/Iran. 


relative humidity variation over a year in different cities. The 
highest humidity level is seen in Bandar Abbas in warm seasons. In 
general, as demonstrated, while Bandar Abbas is the most humid 
city among those studied, Yazd normally experiences the driest 
weather conditions during the year. Moreover, the range of relative 
humidity change is more visible in Bandar Abbas and Tabriz than in 
the other cities. However, the variation of relative humidity in 
Bandar Abbas is clearly visible compared to the other cities. 

The hourly dry bulb, dew point and air temperature variations 
over a year in different cities are shown in Fig. 8(b), (c) and (d). 
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Table 5 
Comparison of five prediction models for the Bandar Abbas dataset. 
Prediction model/metric MSE (kW)? R? Time(s) 
Train Test Train Test 
Linear regression 5.05 5.64 0.93 0.92 2.01 
SVR 11.63 15.91 0.86 0.79 10.33 
Regression tree 0.48 0.68 0.99 0.99 1.01 
Random forest 0.13 0.28 1.007 =1.00" — 6.25 
KNN 15.66 24.03 0.79 0.68 1.26 
* The actual value is 0.99830 that is rounded up to 1. 
> The actual value is 0.99691 that is rounded up to 1. 
Table 6 
Comparison of five prediction models for the Yazd dataset. 
Prediction model/metric MSE (kW)? R? Time(s) 
Train Test Train Test 
Linear regression 10.17 11.00 0.75 0.72 1.99 
SVR 9.68 14.57 0.76 0.66 10.94 
Regression tree 0.48 1.03 0.99 0.97 1.15 
Random forest 0.13 1.00 0.28 0.99 6.24 
KNN 10.93 17.78 0.73 0.55 1.33 
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Fig. 18. The execution time (a), MSE (b) and R? (c) of the linear regression model for the 
three datasets and four climate zones. 


Bandar Abbas has the highest dry bulb temperature, followed by 
Yazd, Tehran and Tabriz. While the warmest time of the year is seen 
in Bandar Abbas in July, the coldest time of year among the selected 
cities is found in Tabriz in January. It should be noted that in 
addition to the ambient temperature, seasonal temperature dif- 
ferences also influence the total energy consumption of buildings. 
The largest seasonal temperature difference is observed in Tabriz, 
with a 27 °C temperature difference between winter and summer. 
Bandar Abbas experiences the lowest seasonal temperature dif- 
ference, with an almost 18 °C temperature difference between 


winter and summer. However, the temperature range changes in 
Tehran and Yazd are approximately identical. 

As can be seen in Fig. 8(e), wind speed, unlike other environ- 
mental parameters, does not follow a specific pattern. For instance, 
in Tehran, the greatest variation in wind speed occurs in February to 
August, whereas in Yazd, it occurs between January and June. This 
phenomenon is explained by the geographic location of Iran. 


3.3. Behavioral patterns 


The real office building daily profiles including occupancy and 
lighting are used to set the schedules that influence the energy 
consumption of the building. The specific behavioral patterns for 
the occupancy (Duarte et al., 2013) and lighting (Li and Lam, 2001; 
Jiang et al., 2018) profiles are shown in Fig. 9(a) and (b). 


3.4. Building energy data 


The monthly electricity and gas consumption data of the 
buildings are collected from the available electricity and gas bills for 
the past 4 years. Working days, number of people in each building, 
and the type of cooling/heating systems are determined using the 
questionnaires that was filled by building’s user personal. Lastly, 
the collected energy consumption data are simply converted from 
monthly to an hourly basis. 

Figs. 10 and 11 show the hourly energy consumption for heating 
and cooling in the studied climate zones. As indicated in the figures, 
climatic conditions have a significant impact on building energy 
consumption. In this study, four cities with different climatic fea- 
tures were selected to demonstrate the impact of climate and 
building characteristics on energy consumption prediction models. 

Bandar Abbas consumes the least heating and the most cooling 
energy among all studied cities because of its hot and humid 
climate. In contrast, as Tabriz is a cold and dry city, it consumes the 
most heating energy and the least cooling energy. In addition, all 
cities follow the same pattern of cooling and heating energy con- 
Sumption over a year. 


4. Results 
4.1. Principal components identification 


As indicated, the goal of the PCA model is to find a subset of 
reduced size with new variables in which the projected individuals 
retain their initial structures with the least possible distortion 
(Jiang et al., 2018). The PCA algorithm in this paper was used to 
identify the main components in the energy consumption of 
buildings based on available historical data. Fig. 12 demonstrates 
the loading plots for the Tehran data. As noted above, the loading 
plot shows the highly correlated parameters and the most influ- 
ential factors for energy consumption. As the loading factors of a 
parameter increase, it shows that the parameter contributes more 
to principle components (PCs); as a result, this loading factors has a 
greater effect on the energy consumption and prediction model. 
Fig. 12(a) shows the load of all variables on two main principal 
components (PC1 and PC2). Fig. 12(b) shows PC3 versus PC1. It was 
assumed that these three principal components are the main 
components in energy consumption prediction. Although this 
assumption could reduce the accuracy of the modeling, the aim of 
this research was to demonstrate the effectiveness of the proposed 
methodology. 

Fig. 13 shows that PC1, PC2 and PC3 cover 57% of the Tehran 
historical data. Although 60% is a low value to represent the 
behavior of the system, it is accurate enough to determine the most 
and least influential factors. 
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Fig. 19. Effectiveness of the data reduction methods as compared to the base case for the linear regression model. 


The scatter score plot of PC-2 vs PC-1 for Tehran dataset is 
presented in Fig. 14. The scattered score plot could presents certain 
groupings in the data. For instance, in the presented figure, the 
dataset could be split into three or four distinct groups. This sug- 
gests that it is possible to classify the data based on these groups. 
This possibility of using PCA for classification forms the basis for a 
classification method called soft independent modelling of class 
analogies (SIMCA). In SIMCA method, for each data group (repre- 
sented by different colors on the scattered score graph), a specific 
energy consumption prediction model could be developed that has 
a lower residual in comparison to a prediction model applied to the 
entire dataset. 

Figs. 15—17 show the loading, cumulative explained variance 
and scatter score plots for Tabriz, Bandar Abbas and Yazd, respec- 
tively. The point labels in all loading plots are the same as in Fig. 12. 

The least influential parameters of all climate zones were 
derived from the loading plots and are presented in Table 2. The 
first column shows the least influential parameters in the PC1 and 
PC2 loading plot; the second column shows the least influential 
factors in the PC1 and PC3 loading plot; and the last column shows 
the common parameters, presented in both column 1 and column 
2. As these columns show the least influential factors of PC1, PC2 
and PC3, these parameters can be eliminated in energy prediction 
models to improve the execution time. 


4.2. Comparison of building energy consumption prediction models 


Energy consumption was forecasted using five prediction 
models, namely linear regression, SVR, regression tree, random 
forest and KNN models. The observed data of each city is split into a 
training (80%) and a test (20%) datasets. The test dataset is used to 
provide an unbiased evaluation of the prediction models’ perfor- 
mance. In order to study the effectiveness of the model in all 
weather conditions, 20% of the observed data for each city during 
each month is randomly selected to be used in the test dataset. The 
rest of the data of each month (80%), hence 80% of all observed data, 
are used as a training dataset. 

Tables 3—6 present the comparison of these methods for the 
Tehran, Tabriz, Bandar Abbas and Yazd datasets. The comparison is 
based on two factors, the R? and MSE. R? is the proportion of the 
variance in the dependent variable that is predictable based on the 
independent variable, and it is a dimensionless quantity. In some 
cases, high R2 values could happen in time series datasets where 
dependent and independent variables both have trends over time. 
As our data are gathered over time with trends in some of the 
features (e.g. ambient temperature), it could result in high R? values 


in some of the prediction models. Therefore, MSE has been calcu- 
lated as an additional index for quantifying the performance of the 
models. MSE is a measure of the average squared difference be- 
tween the estimated energy consumption values and the actual 
value. The unit of energy consumption prediction in each hour is 
expressed in kW. Therefore, the MSE unit is presented in (kW). 

As can be seen, the prediction models perform differently 
depending on the climate zone. These differences show that the 
selection of a prediction model significantly depends on the his- 
torical data pattern. For instance, a highly efficient prediction 
model for a residential building cannot be generalized to other 
buildings, as the climate and other historical data are different. In 
most of the reviewed literature, a prediction model that fits well 
with a case study is generalized to a building type, such as com- 
mercial or residential. However, in addition to the building type, 
prediction model accuracy depends on the climate and the histor- 
ical data of the building. As a result, different prediction models 
should be utilized to select the most efficient model for each case 
study. This approach significantly increases the execution time, as it 
requires running a different prediction model with a large historical 
dataset for every case under study. In this study, the PCA method 
was proposed to reduce the execution time. The related results are 
presented in the following sections. 

Another point that could be seen in Table 6 is that random forest 
method fits better on the test set than the training set. In most 
cases, test sets have a higher error in comparison with training sets; 
however, it is totally possible that test sets have a higher R? value 
than training sets, as is presented in Table 6 for random forest 
method. This usually happens when a model is generalized well 
and/or when a training set is large, but the test set is small. The Yazd 
data set has a wide range in comparison to other cities that could be 
inferred from Figs. 8, 10 and 11. Therefore, the random forest model 
that is trained based on this dataset is well generalized, i.e., it has 
the ability to adapt properly to new, previously unseen data, drawn 
from the same distribution. As a result, the test data set fits better 
than the training set. 


4.3. Performance validation of the proposed methodology 


To evaluate the performance of the PCA method, we undertook a 
comparison between using and not using this method for data 
reduction. For this purpose, three datasets were selected: 


1 Original building dataset (base case): Building raw data without 
any preprocessing were considered. These data cover all 20 
building features and climate conditions presented in Fig. 1. 
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Fig. 20. Effectiveness of the data reduction methods as compared to the base case for the SVR (a), regression tree (b), random forest (c) and KNN (d) models. 


2 PCA-based reduced dataset: The least influential factors were 
eliminated. This dataset includes all features except the com- 
mon features presented in Tables 3—6. 

3 Randomly reduced dataset: This dataset has the same number of 
features as the PCA-based reduced dataset, but the features 
were selected randomly. For instance, according to Table 2, 


seven features were eliminated for the PCA-based data reduc- 
tion for the Tehran dataset. More specifically, seven features 
were randomly eliminated to generate the randomly reduced 
dataset for Tehran. 


Fig. 18(a), (b) and (c) present the MSE, R? and execution time of 
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Fig. 21. One-day-ahead prediction of energy consumption in Tehran (a), Tabriz (b), Yazd (c) and Bandar Abbas (d) case studies. 
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Fig. 22. Residual plot of energy consumption prediction models for Tehran (a), Tabriz (b), Yazd (c) and Bandar Abbas (d) case studies. 


the linear regression model for the three datasets, respectively. As 
mentioned, the base case dataset involves 20 building features, 
including climate and building characteristics. The PCA-based 
reduced dataset considers all features except the common, less 
important features shown in Table 2. 

The randomly reduced dataset contains the reduced data with 
exactly the same number of features as in the PCA-based dataset. In 
this dataset, the modeler selects the most important features based 
on an expert's judgment or randomly. There are multiple studies, 


such as (Li et al., 2009; Chou and Bui, 2014; Shao et al., 2020), that 
have selected limited number of features to predict/model energy 
consumption, and they have not performed any analysis to show 
how the features were selected and why these features are the 
most important ones. It is true that decreasing the number of fea- 
tures would result in better execution time; however, it decreases 
model accuracy as well. 

The rationale behind using the randomly reduced dataset, is to 
emphasize that it is preferred not to remove features without 
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Table 7 

The MSE and R? for the most efficient prediction method for all case studies. 
Climate zone Prediction model/metric = MSE(kW) R? Time(s) 
Tehran Random forest 0.23 0.99 4.91 
Tabriz Regression tree 1.02 0.98 1.01 
Yazd Random forest 0.21 0.99 4.57 
Bandar Abbas Regression tree 1 0.99 0.89 


performing any preprocessing data and feature reduction methods. 
The results of this section show the importance of dimensionality 
reduction methods. 

The MSE, R? and execution time are the factors that illustrate the 
effectiveness of the data reduction process. Fig. 18(a) presents the 
execution time of the linear regression model for the three datasets 
and four climate zones. The reduction of the execution time is not 
the same for each climate zone. This finding demonstrates that, in 
addition to the prediction model, the data reduction method 
significantly depends on the data pattern. For instance, the 
execution time for all three Tehran datasets was almost identical, 
whereas for the Tabriz dataset the execution time was reduced by 
13.44%, and for Yazd and Bandar Abbas, it decreased more 
significantly. 

As indicated by Fig. 18(a), the data reduction methods reduced 
the execution time. However, data reduction also decreased the 
model performance. The performance of the prediction models is 
presented by the MSE and R°. As shown in Fig. 18(b), the error 
increased significantly with the random data reduction method. 
However, this error slightly increased with the PCA-based data 
reduction method. The R° value followed the same pattern: its 
value was slightly lower for the PCA-based data reduction method 
and much higher for the random data reduction method. This result 
Shows the effectiveness of data reduction based on the PCA 
method. 

Fig. 19 summarizes the effectiveness of the PCA-based and 
random data reduction methods in comparison with the base case 
for a linear regression model. The blue columns represent PCA-based 
data reduction, and the red columns show the random data reduc- 
tion results. As can be seen, the model performance was not the 
same for the different datasets (climate zones). Moreover, PCA-based 
data reduction had the strongest effect on the Yazd dataset when 
using a linear regression model, as it reduced the execution time by 
around 25%, with a 3% increase in the MSE and a 1% decrease in R°. 
For the random data reduction method, the MSE was 144.28% higher 
and the R? was approximately 50% lower than with the base case. 

Fig. 20 presents the same comparison of datasets discussed 
above for the other prediction models, namely the SVR, regression 
tree, random forest and KNN models. The results indicate that the 
effectiveness of data reduction was significantly dependent on the 
prediction method. According to Fig. 20(a), for the SVR model, PCA- 
based data reduction reduced the execution time by around 50% 
with a low effect on the MSE and R° values. 

Fig. 20(b) shows that with the regression tree method, the 
execution time decreased slightly; however, the MSE increased 
dramatically. This result was due to the low value of the MSE with 
the regression tree method. As presented in Tables 3—6, the MSE for 
the regression tree and random forest models was significantly 
lower than for the linear regression, SVR and KNN models. As a 
result, a slight change in the MSE in the regression tree and random 
forest models could results in a high percentage value, as illustrated 
in Fig. 20(b) and (c). In contrast, the R° values did not change 
notably. Therefore, the results suggest that PCA-based data reduc- 
tion could result in an accurate regression tree model. However, the 
execution time was only moderately reduced. 

Fig. 20(c) shows the effectiveness of the PCA-based data 


reduction method for the random forest model. As can be seen, the 
MSE values increased significantly for the same reasons noted 
above for the regression tree model. However, for the random 
forest method, the execution time was reduced by around 25%, 
which is a remarkable value. Therefore, the PCA-based data 
reduction method could be an effective reduction method for 
regression tree models. 

As illustrated in Fig. 20(d), the PCA-based data reduction 
method had a weak effect in terms of reducing the execution time 
for the KNN model. In addition, the MSE values increased, while the 
R° values decreased significantly. As a result, it can be inferred that 
PCA-based data reduction is not an effective solution for decreasing 
the execution time for KNN models. Overall, the results support the 
following points regarding PCA-based data reduction: 


a) It reduced dataset has a shorter execution time compared to the 
original dataset due to the lower number of features. 

b) It has a higher accuracy (lower MSE and higher R?) compared to 
the datasets with the same number of features (and almost the 
Same execution time). 


4.4. Building energy consumption prediction 


This section describes the prediction of the energy consumption 
of the case studies on a day-ahead basis. Data were reduced based 
on the PCA method, and the prediction model was selected ac- 
cording to the results in section 4-3. Fig. 21 presents the prediction 
of energy consumption for the last day of the year, represented with 
dashed red lines. The prediction is validated by results from the 
collected meteorological and energy data, represented by solid blue 
lines. 

According to Fig. 21, the energy consumption patterns in Tehran 
and Yazd are almost the same. This result is due to the mild and 
moderate weather in both cities. Therefore, the most efficient 
prediction method for these two cases is the random forest 
method. The main difference between these two cities is the peak 
level of energy consumption that occurs in the early morning. The 
peak level is higher in Tehran due to the lower ambient tempera- 
ture in this region. 

However, in Tabriz and Bandar Abbas, the energy consumption 
patterns are completely different. Tabriz has a cold climate; 
therefore, energy consumption on the last day of December is 
higher in comparison to the other cities. This city has energy con- 
sumption during off time of the building, unlike other cities. This 
difference is mainly due to the low ambient temperature in this 
city, which requires that the heating system start working from 
midnight (hour 0 on the graph) to keep the building conditions in 
the acceptable temperature range during working hours. In 
contrast, Bandar Abbas has a hot climate, and the daytime ambient 
temperature is high. Therefore, energy consumption is low during 
the day, but increases during hours 16—18 when there is insuffi- 
cient solar radiation and the ambient temperature is lower. 

In addition, the energy consumption profiles in all the case 
studies follow the occupancy profile presented in Fig. 9(a). Ac- 
cording to this occupancy profile, the office building is occupied 
from hours 7—18. Therefore, the electrical demands are highest 
during those hours. Furthermore, heating and ventilation systems 
should provide a comfort zone during working hours. To meet this 
criterion, heating and ventilation systems need to start working 
earlier, depending on the ambient conditions. 

Fig. 22 presents the standardized residual plot of the energy 
consumption prediction models for all cities. The plot shows the 
standardized residual (test prediction error) as a function of energy 
consumption observation values. The standardized residual could 
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be calculated as Eq. (3). 


Residual (i) 
Standard deviation of residuals 
i _ Actual value (i) — Predicted value (i) 


Standardized residual = 





(3) 
(Residual (i)—Mean of residuals)? 
Number of tests 


As can be seen, the pattern of energy consumption was pre- 
dicted with high accuracy, and the values were predicted with 
acceptable accuracy. The MSE and R° metrics serve to evaluate the 
accuracy of the methods for all the case studies, as shown in Table 7. 
The most efficient prediction methods are the random forest and 
regression tree approaches, depending on the historical climate 
data. The reason is that these two prediction methods can perfectly 
predict datasets with a large number of features. In addition, the 
results imply that case studies with the same energy consumption 
patterns will have the same most efficient prediction methods. 


5. Conclusion 


In this paper, a hybrid PCA-based prediction method is proposed 
to predict building’s energy consumption. As time passes, the his- 
torical meteorological and operational data of a building grow 
significantly. To develop an accurate energy consumption predic- 
tion model with a reasonable execution time, researchers should 
undertake data preprocessing. In this study, PCA is introduced as a 
data reduction method, and data preprocessing is performed for 
five prediction models including linear regression, SVR, regression 
tree, random forest and KNN. In addition, four types of datasets 
(four energy consumption patterns) are gathered to study the effect 
of the preprocessing method on the prediction models’ perfor- 
mance. The results indicate that the PCA method can be a useful 
data reduction approach that significantly reduces the execution 
time of energy consumption prediction models. 

In addition, according to the results, prediction model perfor- 
mance depends on the data pattern significantly, i.e., the best 
prediction model for cities with similar data patterns is the same. 
The PCA-based random forest model is proposed for Tehran and 
Yazd, and the best results are obtained from the regression tree for 
Tabriz and Bandar Abbas, with an MSE lower than 1 and an R? of 
around 0.99. The results verify that this approach could be applied 
to any other energy prediction models with large datasets, resulting 
in an accurate prediction with a significantly reduced execution 
time. This methodology could be beneficial in online performance 
monitoring, failure diagnosis and optimization systems that require 
highly efficient prediction models with low execution time. 
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